Goto

Collaborating Authors

 subspace recovery




Subspace Recovery from Heterogeneous Data with Non-isotropic Noise

Neural Information Processing Systems

Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distribution with mean $\mu_i$. Our goal is to recover the linear subspace shared by $\mu_1,\ldots,\mu_n$ using the data points from all users, where every data point from user $i$ is formed by adding an independent mean-zero noise vector to $\mu_i$. If we only have one data point from every user, subspace recovery is information-theoretically impossible when the covariance matrices of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in previous work. We avoid these assumptions by leveraging at least two data points from each user, which allows us to design an efficiently-computable estimator under non-spherical and user-dependent noise. We prove an upper bound for the estimation error of our estimator in general scenarios where the number of data points and amount of noise can vary across users, and prove an information-theoretic error lower bound that not only matches the upper bound up to a constant factor, but also holds even for spherical Gaussian noise. This implies that our estimator does not introduce additional estimation error (up to a constant factor) due to irregularity in the noise. We show additional results for a linear regression problem in a similar setup.




Global Convergence of Iteratively Reweighted Least Squares for Robust Subspace Recovery

arXiv.org Machine Learning

Robust subspace estimation is fundamental to many machine learning and data analysis tasks. Iteratively Reweighted Least Squares (IRLS) is an elegant and empirically effective approach to this problem, yet its theoretical properties remain poorly understood. This paper establishes that, under deterministic conditions, a variant of IRLS with dynamic smoothing regularization converges linearly to the underlying subspace from any initialization. We extend these guarantees to affine subspace estimation, a setting that lacks prior recovery theory. Additionally, we illustrate the practical benefits of IRLS through an application to low-dimensional neural network training. Our results provide the first global convergence guarantees for IRLS in robust subspace recovery and, more broadly, for nonconvex IRLS on a Riemannian manifold.


Subspace Recovery in Winsorized PCA: Insights into Accuracy and Robustness

arXiv.org Machine Learning

In this paper, we explore the theoretical properties of subspace recovery using Winsorized Principal Component Analysis (WPCA), utilizing a common data transformation technique that caps extreme values to mitigate the impact of outliers. Despite the widespread use of winsorization in various tasks of multivariate analysis, its theoretical properties, particularly for subspace recovery, have received limited attention. We provide a detailed analysis of the accuracy of WPCA, showing that increasing the number of samples while decreasing the proportion of outliers guarantees the consistency of the sample subspaces from WPCA with respect to the true population subspace. Furthermore, we establish perturbation bounds that ensure the WPCA subspace obtained from contaminated data remains close to the subspace recovered from pure data. Additionally, we extend the classical notion of breakdown points to subspace-valued statistics and derive lower bounds for the breakdown points of WPCA. Our analysis demonstrates that WPCA exhibits strong robustness to outliers while maintaining consistency under mild assumptions. A toy example is provided to numerically illustrate the behavior of the upper bounds for perturbation bounds and breakdown points, emphasizing winsorization's utility in subspace recovery.


Nested subspace learning with flags

arXiv.org Machine Learning

Many machine learning methods look for low-dimensional representations of the data. The underlying subspace can be estimated by first choosing a dimension $q$ and then optimizing a certain objective function over the space of $q$-dimensional subspaces (the Grassmannian). Trying different $q$ yields in general non-nested subspaces, which raises an important issue of consistency between the data representations. In this paper, we propose a simple trick to enforce nestedness in subspace learning methods. It consists in lifting Grassmannian optimization problems to flag manifolds (the space of nested subspaces of increasing dimension) via nested projectors. We apply the flag trick to several classical machine learning methods and show that it successfully addresses the nestedness issue.


Subspace Recovery from Heterogeneous Data with Non-isotropic Noise

Neural Information Processing Systems

Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from n users with user i contributing data samples from a d -dimensional distribution with mean \mu_i . Our goal is to recover the linear subspace shared by \mu_1,\ldots,\mu_n using the data points from all users, where every data point from user i is formed by adding an independent mean-zero noise vector to \mu_i . If we only have one data point from every user, subspace recovery is information-theoretically impossible when the covariance matrices of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in previous work.


Theoretical Guarantees for the Subspace-Constrained Tyler's Estimator

arXiv.org Machine Learning

This work analyzes the subspace-constrained Tyler's estimator (STE) [12] designed for recovering a low-dimensional subspace within a dataset that may be highly corrupted with outliers. It assumes a weak inlier-outlier model and allows the fraction of inliers to be smaller than a fraction that leads to computational hardness of the robust subspace recovery problem. It shows that in this setting, if the initialization of STE, which is an iterative algorithm, satisfies a certain condition, then STE can effectively recover the underlying subspace. It further shows that under the generalized haystack model, STE initialized by the Tyler's M-estimator (TME), can recover the subspace when the fraction of iniliers is too small for TME to handle.